Project Presentation
Group #6
Team members
- Zhuoyan Guo
- Xue Qin
- Congfei Yin
Predicting Aids Patient Survival with Different Treatment
Introduction
Background
HIV, or human immunodeficiency virus, is a virus that attacks the immune system by targeting white blood cells, making the body more vulnerable to illnesses like tuberculosis, infections, and certain cancers. Without treatment, HIV can progress over time to its most severe stage, AIDS (acquired immunodeficiency syndrome). The virus is transmitted through bodily fluids such as blood, breast milk, semen, and vaginal fluids, and can also be passed from mother to child.1
The symptom and signs are as follows:
fever
headache
rash
sore throat
Data sources
It’s a public dataset contains records of patients who were diagnosis with AIDS.
The data contained 2139 intances, include integer and catugoriacal variables
Project Goal
The goal of this project is to predict Aids patient treatment failure (death) or survival (censoring) with a certain window of time by comparing two different treatments and looking at the clinical factors such as patient demographics, treatment indicators, and clinical factors such as CD4 counts and Karnofsky scores.
Methods
Survival Analysis
Logistic Regression
Logic Regression
Decision Tree
Methodology that will be included in this project would be survival analysis and machine learning methods. Kaplan-Meier Estimator and survival curve will help to compare the difference of using two treatments by visualizing probability of survival over time. The prediction models such as Linear Regression, Logistic Regression and Decision Tree will be used to predict the patients’ survival outcome. The evaluation of models will also proceed during the model development.
Questions to address
- How do patient demographics (e.g., age, gender, baseline health status) influence the effectiveness of different treatment methods on survival outcomes?
- What clinical features (e.g., CD4 counts, Karnofsky score) significantly impact survival outcomes for patients undergoing different treatments?
- Which combination of CD4 counts and Karnofsky scores provides the most accurate prediction of patient outcomes for each treatment method?
- Which treatment method would be better?
- How do patient demographics (e.g., age, gender, baseline health status) influence the effectiveness of different treatment methods on survival outcomes?
- What are the most important indicators/features that better give the diagnois result?
- What are the most significant clinical factors influencing patient survival when considering CD cell counts and treatment methods?
- what is the best model to predict the diagnosis result?
Exploratory Analysis
Statistic Summary
Average survival time is around 879.1 days
The age of patients range from 12 to 70 years old and the average age is 35.25 years old.
The average weight is range from 31 kg to 159.94 kg and the average weight is 75.13kg
Feature Correlation Heatmap
- The warmer color shows the positive correlated relationship
- variable cd820 (CD8 at 20+/-5 weeks) and cd80 (CD8 at baseline) are highly correlated
- cd420(CD4 at 20+/-5 weeks) and cd40(CD4 at baseline)
Distribution of Counts of Treatment Group
- Each treatment group are almost equally distributed
- Would not cause any bias about the final treatment results
Gender proportion by treatment group
- The number of male patients are much larger than female patients
- Every treatment is balanced distributed
Relationship between CD4 and CD8 counts
- No big difference between CD4 and CD8 in each treatment group
- No clear trend that shows the relationship between CD4 and CD8
Pie Chart for Diagnoisis Outcome distribution
Correlation Heatmap of Karnofsky score
- The positive correlation between the Karnofsky score and CD4 counts suggests that patients with better functional abilities
Survival Analysis
By applying Survival Analysis, we want to answer the following questions:
What are the survival probabilities associated with different treatment methods?
How do clinical indicators or demographic factors affect survival outcomes?
We focus on estimating survival probabilities and comparing different groups using Kaplan-Meier curves and the Cox Proportional Hazards model.
Kaplan-Meier
- Analyze Survival Probabilities Across Treatments
trt=0 showing a significant worse outcomes compared to the other three treatments indicating by the p-value < 0.0001.
Patients who toke trt=1, trt=2, and trt=3 shows similar survival curves.
- Focus on Effective Treatments: Survival Differences in trt=1,2,3
- The survival probabilities across these treatments are nearly indistinguishable according to the p-value = 0.4.
- This similarity implies that external factors – other than treatments themselves –may play a role in shaping the survival outcomes.
- Explore Factors That Affect Survival Probabilities Using the Bootstrap Method
Gender:
- For trt=0, gender significantly impacts survival probabilities (p-value = 0.024), with female showing higher survival probabilities.
- Gender differences are less significant for trt=1, trt=2, and trt=3 (p-values = 0.35, 0.061, and 0.79).
ARV History:
- ARV history means the whether the patient has ever taken the antiretroviral treatments, which is the treatment of HIV/AIDS involves treatment of opportunistic infections, prophylaxis, and antiretroviral (ARV) therapy.
- Using trt=1, antiretroviral history (naive vs. experienced) does not significantly impact survival outcomes for patients.
- Using trt=0, trt=2, and trt=3, ARV history has a substantial impact on survival probabilities. Patients with prior ARV treatment show significantly lower survival probabilities compared to naive patients due to the impact of drug resistance.
Symptom
- For trt=0 and trt=3, the p-values (0.11 and 0.38) indicate no statistically significant difference in survival probabilities between symptomatic and asymptomatic patients.
- For trt=1, the p-value (0.0039) shows a significant difference, where asymptomatic patients exhibit slightly better survival probabilities compared to symptomatic patients.
Karnofsky Score
- Karnofsky score measures a patient’s functional status on a scale from 0 to 100, with higher scores indicating better physical capability and ability to carry out daily activities.
- There is no significant difference in survival probabilities for patients using treatment 0, 1, and 3.
- Karnofsky score quartiles significantly impact survival probabilities (p = 0.038), with higher quartiles (Q3 and Q4) associated with improved survival outcomes if patients receiving trt=2.
CD4 at Baseline
- The CD4 cell count is a critical measure of immune system health, which reflects the number of CD4 T-helper cells in the blood, which are essential for immune function.
- There is a consistent and significant impact of CD4 quartiles on survival probabilities across all treatment groups: survival probabilities significantly improve with higher CD4 quantiles, saying that immune status is a key determinant of survival outcomes for all groups and patients with higher CD4 consistently demonstrate better survival probabilities.
Cox Proportional Hazard Models
- The Kaplan-Meier (KM) curves provide a fundational view of survival probabilities by visualizing unadjusted survival outcomes across different strata.
- The Cox proportional hazards model is a semi-parametric approach that allows for the simultaneous inclusion of multiple covariates. And it can adjust for confounding factors, ensuring a clearer understanding of each variable’s independent effect on survival.
- Cox Model with Interaction Terms
Logistic Regression
Model Analysis
The dataset we are using in this analysis are the filtered and scaled version of the original dataset, we first scaled all continues variables, and then set the landmark at time = 1000, we kept every observation failed before time=1000 and everyone who is still alive at time-1000, sample 503 observations out of each groups of data to prevent issues of the unbalance dataset will bring to the prediction model.
At time = 1000, we have a similar sizes of patients with Cid=0 and Cid = 1. The original dataset has 2139 observations, 24 variables, and the modifed data has 1006 observation.
- Confusion matrix
| Prediction vs Actual | 0 | 1 |
|---|---|---|
| 0 | 120 | 46 |
| 1 | 30 | 104 |
- Model Results
Backward Selection
| Metric | Value |
|---|---|
| Accuracy | 74.67% |
| Sensitivity | 0.8000 |
| Specificity | 0.6933 |
| p-value | < 2e-16 |
Feature Impoartance
karnof, symptom, offtrt, cd420, cd820.
offtrt and cd420 are strong predictors of survival, showing a negative impact when their respective conditions worsen. cd820 show positive associations with survival, highlighting potential protective effects or favorable prognostic indicators. These insights could inform clinical interventions by emphasizing the need to prioritize treatment adherence and monitor biomarkers like cd420 and cd820.
Decision Tree
Model Analysis
- Confusion matrix
| Prediction vs Actual | 0 | 1 |
|---|---|---|
| 0 | 123 | 57 |
| 1 | 27 | 93 |
- Model Results
| Metric | Value |
|---|---|
| Accuracy | 72% |
| Sensitivity | 0.8200 |
| Specificity | 0.6200 |
| p-value | 7.335e-15 |
Feature Impoartance
cd420, offtrt, age, preanti, cd40, wtkg, cd820
Random Forest
Model Analysis
After feature selection and hyperparameter tuning, the final model achieved an improved accuracy of 70.33%. Sensitivity (76%) improved, indicating better performance in identifying true positives. The specificity (64.00%) remained consistent, reflecting less reliable identification of true negatives.
- Confusion matrix
| 0 | 1 | |
|---|---|---|
| 0 | 114 | 53 |
| 1 | 36 | 97 |
- Model Results
| Metric | Value |
|---|---|
| Accuracy | 0.7033 |
| Sensitivity | 0.7600 |
| Specificity | 0.6467 |
| p-value | 6.97e-13 |
Feature Impoartance
offtrt, cd420, preanti, age, homo, z30, strat, gender, cd40, cd80
Conclusion
Answer to Questions
- How do patient demographics (e.g., age, gender, baseline health status) influence the effectiveness of different treatment methods on survival outcomes?
- Gender
- Antiretroviral History
- What clinical features (e.g., CD4 counts, Karnofsky score) significantly impact survival outcomes for patients undergoing different treatments?
- Symptom
- Karnofsky Score
- CD4 counts at the baseline
- Which treatment method would be better?
- trt=0 is the worst while other three treatments don’t have significant difference.
Tabel: Model Comparasion
| Models | Accuracy Results | Important Features |
|---|---|---|
| Logistic Regression | 0.7467 | karnof, symptom, offtrt, cd420, cd820 |
| Decision Tree | 0.72 | cd420, offtrt, age, preanti, cd40, wtkg, cd820 |
| Random Forest | 0.7033 | offtrt cd420 preanti age homo z30 strat gender cd40 cd80 |
- What are the most important indicators/features that better give the diagnois result?
- gender
- age
- offtrt: indicator of off-trt before 96+/-5 weeks (0=no,1=yes)
- cd420: CD4 at 20+/-5 weeks
- what is the best model to predict the diagnosis result?
- Logistic Regression has the highest prediction accuracy in predicting the diagnosis result.